Audio-visual speech fragment decoding

نویسندگان

Jon Barker

Xu Shao

چکیده

This paper presents a robust speech recognition technique called audio-visual speech fragment decoding (AV-SFD), in which the visual signal is exploited both as a cue for source separation and as a carrier of phonetic information. The model builds on the existing audio-only SFD technique which, based on the auditory scene analysis account of perceptual organisation, works by combining a bottom-up layer which identifies sound fragments, and a model-driven layer which searches for fragment groupings that can be interpreted as recognisable speech utterances. In AV-SFD, the visual signal is used in the model-driven stage improving the ability of the decoder to distinguish between foreground and background fragments. The system has been evaluated using an audio-visual version of Pascal Speech Separation Challenge. At low SNRs, recognition error rates are reduced by around 20% relative to the performance of a conventional multistream AV-ASR system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using twin-HMM-based audio-visual speech enhancement as a front-end for robust audio-visual speech recognition

In this paper we propose the use of the recently introduced twinHMM-based audio-visual speech enhancement algorithm as a front-end for audio-visual speech recognition systems. This algorithm determines the clean speech statistics in the recognition domain based on the audio-visual observations and transforms these statistics to the synthesis domain through the socalled twin HMMs. The adopted fr...

متن کامل

Efficient likelihood computation in multi-stream HMM based audio-visual speech recognition

Multi-stream hidden Markov models have recently been introduced in the field of automatic speech recognition as an alternative to single-stream modeling of sequences of speech informative features. In particular, they have been very successful in audio-visual speech recognition, where features extracted from video of the speaker’s lips are also available. However, in contrast to single-stream m...

متن کامل

Audio-visual speech recognition system for a robot

Automatic Speech Recognition (ASR) for a robot should be robust for noises because a robot works in noisy environments. Audio-Visual (AV) integration is one of the key ideas to improve its robustness in such environments. This paper proposes AV integration for an ASR system for a robot which applies AV integration to Voice Activity Detection (VAD) and speech decoding. In VAD, we apply AV-integr...

متن کامل

Speech-to-lip movement synthesis based on the EM algorithm using audio-visual HMMs

This paper proposes a method to re-estimate output visual parameters for speech-to-lip movement synthesis using audio-visual Hidden Markov Models(HMMs) under the Expectation-Maximization(EM) algorithm. In the conventional methods for speech-to-lip movement synthesis, there is a synthesis method estimating a visual parameter sequence through the Viterbi alignment of an input acoustic speech sign...

متن کامل

NTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition

Although audio-visual speech is well known to improve the robustness properties of automatic speech recognition (ASR) systems against noise, the realm of audio-visual ASR (AV-ASR) has not gathered the research momentum it deserves. This is mainly due to the lack of audio-visual corpora and the need to combine two fields of knowledge: ASR and computer vision. This paper describes the NTCD-TIMIT ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Audio-visual speech fragment decoding

نویسندگان

چکیده

منابع مشابه

Using twin-HMM-based audio-visual speech enhancement as a front-end for robust audio-visual speech recognition

Efficient likelihood computation in multi-stream HMM based audio-visual speech recognition

Audio-visual speech recognition system for a robot

Speech-to-lip movement synthesis based on the EM algorithm using audio-visual HMMs

NTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition

عنوان ژورنال:

اشتراک گذاری